GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD

Introduction

After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you:

  • Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B;
  • Having been trained on data in 27 additional languages besides English and Chinese;
  • State-of-the-art performance in a large number of benchmark evaluations;
  • Significantly improved performance in coding and mathematics;
  • Extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct.

We have opensourced the models in Hugging Face and ModelScope to you and we are looking forward to hearing from you!

Model Information

The Qwen2 series include base and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, Qwen2-72B. We illustrate the key information of the models in the following table:

ModelsQwen2-0.5BQwen2-1.5BQwen2-7BQwen2-57B-A14BQwen2-72B
# Params0.49B1.54B7.07B57.41B72.71B
# Non-Emb Params0.35B1.31B5.98B56.32B70.21B
GQATrueTrueTrueTrueTrue
Tie EmbeddingTrueTrueFalseFalseFalse
Context Length32K32K128K64K128K

Specifically, previously in Qwen1.5, only Qwen1.5-32B and Qwen1.5-110B have adopted Group Query Attention (GQA). This time, for all model sizes, we apply GQA so that they can enjoy the benefits of faster speed and less memory usage in model inference. For small models, we prefer the application of tying embedding as the large sparse embeddings take up a large proportion of the total model parameters.

In terms of the context length, all base language models have been pretrained on data of the context length of 32K tokens, and we observe satisfactory extrapolation capabilities up to 128K in PPL evaluation. However, for instruction-tuned models, we are not satisfied with merely PPL evaluation; we need the models to be capable of correctly understanding long context and completing tasks. In the table, we list the context length capabilities of instruction-tuned models, as assessed through the evaluation of the Needle in a Haystack task. Notably, when augmented with YARN, both Qwen2-7B-Instruct and Qwen2-72B-Instruct models demonstrate an impressive capacity to handle context lengths extending up to 128K tokens.

Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training:

RegionsLanguages
Western EuropeGerman, French, Spanish, Portuguese, Italian, Dutch
Eastern & Central EuropeRussian, Czech, Polish
Middle EastArabic, Persian, Hebrew, Turkish
Eastern AsiaJapanese, Korean
South-Eastern AsiaVietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
Southern AsiaHindi, Bengali, Urdu

Additionally, we have devoted significant effort to addressing code-switching, a frequent occurrence in multilingual evaluation. Consequently, our models’ proficiency in handling this phenomenon have notably enhanced. Evaluations using prompts that typically induce code-switching across languages confirm a substantial reduction in associated issues.

Performance

Comparative assessments reveal substantial enhancements in performance for large-scale models (70B+ parameters) relative to Qwen1.5. Here our evaluation centers on the large-size model Qwen2-72B. In terms of base language models, Qwen2-72B and state-of-the-art open models are evaluated for different capbilities including natural language understanding, knowledge acquisition, coding proficiency, mathematical skills, and multilingual abilities. Benefiting from meticulously curated datasets and optimized training methods, Qwen2-72B exhibits superior performance compared to leading models such as Llama-3-70B. Notably, it surpasses the performance of its predecessor, Qwen1.5-110B, despite having fewer parameters.

After extensive large-scale pre-training, we conduct post-training to further enhance Qwen’s intelligence, bringing it closer to human. This process further improves the model’s capabilities in areas such as coding, mathematics, reasoning, instruction following, multilingual understanding, and more. Additionally, it aligns the model’s output with human values, ensuring that it is helpful, honest, and harmless. Our post-training phase is designed with the principle of scalable training with minimal human annotation. Specifically, we investigate how to obtain high-quality, reliable, diverse and creative demonstration data and preference data with various automated alignment strategies, such as rejection sampling for math, execution feedback for coding and instruction-following, back-translation for creative writing, scalable oversight for role-play, etc. As for training, we apply a combination of supervised fine-tuning, reward model training and online DPO training. We also employ a novel Online Merging Optimizer to minimize the alignment tax. These collective efforts have significantly boosted the capabilities and intelligence of our models, as illustrated in the following table.

We comprehensively evaluate Qwen2-72B-Instruct on 16 benchmarks across various domains. Qwen2-72B-Instruct strikes a balance between obtaining better capabilities and aligning well with human values. Specifically, Qwen2-72B-Instruct significantly surpasses Qwen1.5-72B-Chat across all benchmarks, and also reaches competitive performance compared with Llama-3-70B-Instruct.1

In terms of smaller models, our Qwen2 models also outcompete the SOTA models of similar or even larger sizes. In comparison with the very recently released SOTA models, Qwen2-7B-Instruct can still demonstrate advantages across benchmarks, showing specifically outstanding performance on coding and Chinese-related metrics.1

Highlights

Coding & Mathematics

We have persistently dedicated our efforts to enhance the advanced capabilities of Qwen, particularly in coding and mathematics. In coding, we have successfully integrated the code training experience and data from CodeQwen1.5, resulting in significant improvements in Qwen2-72B-Instruct across various programming languages. Regarding mathematics, by exploiting the extensive and high-quality datasets, Qwen2-72B-Instruct has reflects stronger capabilities in solving mathematic problems.

Long Context Understanding

In Qwen2, all instruction-tuned models have been trained on 32k length contexts, and extrapolated to longer context lengths using techniques like YARN or Dual Chunk Attention.

The figure below shows our test results on the Needle in a Haystack. Notably, Qwen2-72B-Instruct is capable of flawlessly handling information extraction tasks within a 128k context. Coupled with its inherent strong performance, it becomes the preferred choice for handling long text tasks when resources are sufficient.

Additionally, it’s worth noting the impressive capabilities of other models in the series: Qwen2-7B-Instruct nearly flawlessly handles contexts up to 128k in length, Qwen2-57B-A14B-Instruct manages contexts up to 64k, and the two smaller models in the lineup support contexts of 32k.

Alongside the long-context models, we have also open-sourced an agent solution for efficiently processing documents containing up to 1 million tokens. For more details, see our dedicated blog post on this topic.

Safety and Responsibility

The table below presents the proportion of harmful responses generated by large models for four categories of multilingual unsafe querys(Illegal Activity, Fraud, Pornography, Privacy Violence). The test data was derived from Jailbreak and translated into multiple languages for evaluation. We find that Llama-3 does not effectively handle multilingual prompts, and therefore, it is not included in the comparison. Through significance testing (P_value), we found that the Qwen2-72B-Instruct model performs comparably to GPT-4 in terms of safety, and significantly outperforms the Mistral-8x22B model.

LanguageIllegal ActivityFraudPornographyPrivacy Violence
GPT-4Mistral-8x22BQwen2-72B-InstructGPT-4Mistral-8x22BQwen2-72B-InstructGPT-4Mistral-8x22BQwen2-72B-InstructGPT-4Mistral-8x22BQwen2-72B-Instruct
zh0%13%0%0%17%0%43%47%53%0%10%0%
en0%7%0%0%23%0%37%67%63%0%27%3%
ar0%13%0%0%7%0%15%26%15%3%13%0%
es0%7%0%3%0%0%48%64%50%3%7%3%
fr0%3%0%3%3%7%3%19%7%0%27%0%
ko0%4%0%3%8%4%17%29%10%0%26%4%
pt0%7%0%3%7%3%47%57%47%4%26%4%
th0%10%0%7%23%3%13%17%10%13%7%7%
vi0%4%0%4%11%0%22%26%22%0%0%0%
Average0%8%0%3%11%2%27%39%31%3%16%2%

Developing with Qwen2

Now all models have been released in Hugging Face and ModelScope. Feel free to visit the model cards for detailed usages, and learn more information about each model, including its features, performance, etc.

For a long time, a lot of friends have been supporting the development of Qwen, including finetuning (Axolotl, Llama-Factory, Firefly, Swift, XTuner), quantization (AutoGPTQ, AutoAWQ, Neural Compressor), deployment (vLLM, SGL, SkyPilot, TensorRT-LLM, OpenVino, TGI), API platforms (Together, Fireworks, OpenRouter), local run (MLX, Llama.cpp, Ollama, LM Studio), Agent and RAG Frameworks (LlamaIndex, CrewAI, OpenDevin) , Evaluation (LMSys, OpenCompass, Open LLM Leaderboard), model training (Dolphin, Openbuddy) etc. For how to use Qwen2 with the third-party frameworks, please refer to the respective documentation as well as our official documentation.

Still there are a number of teams and people not mentioned that have made contributions to Qwen. We sincerely thank them for the support, and we hope that our collaboration can boost the research and development of the opensource AI community.

License

This time, we change the licenses of our models to different ones. While Qwen2-72B as well as its instruction-tuned models still uses the original Qianwen License, all other models, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, and Qwen2-57B-A14B, turn to adopt Apache 2.0! We believe that the enhanced openness of our models to the community can accelerate the applications and commercial usages of Qwen2 all around the world.

What’s Next for Qwen2?

We are training larger Qwen2 models to further explore model scaling along with our recent data scaling. Additionally, we extend the Qwen2 language models to multimodal, capable of understanding both vision and audio information. In the near future, we will continue opensource new models to accelerate opensource AI. Stay tuned!

Citation

If you find our work helpful, feel free to give us a cite!

@article{qwen2,
      title={Qwen2 Technical Report}, 
      author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
      journal={arXiv preprint arXiv:2407.10671},
      year={2024}
}



Appendix

Base Language Model Evaluation

The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.

The datasets for evaluation include:

English Tasks: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)

Coding Tasks: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)

Math Tasks: GSM8K (4-shot), MATH (4-shot)

Chinese Tasks: C-Eval(5-shot), CMMLU (5-shot)

Multilingual Tasks: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)

Qwen2-72B performance

DatasetsDeepSeek-V2Mixtral-8x22BLlama-3-70BQwen1.5-72BQwen1.5-110BQwen2-72B
ArchitectureMoEMoEDenseDenseDenseDense
#Activated Params21B39B70B72B110B72B
#Params236B140B70B72B110B72B
English
MMLU78.577.879.577.580.484.2
MMLU-Pro-49.552.845.849.455.6
GPQA-34.336.336.335.937.9
Theorem QA-35.932.329.334.943.1
BBH78.978.981.065.574.882.4
HellaSwag87.888.788.086.087.587.6
WindoGrande84.885.085.383.083.585.1
ARC-C70.070.768.865.969.668.9
TruthfulQA42.251.045.659.649.654.8
Coding
HumanEval45.746.348.246.354.364.6
MBPP73.971.770.466.970.976.9
EvalPlus55.054.154.852.957.765.4
MultiPL-E44.446.746.341.852.759.6
Mathematics
GSM8K79.283.783.079.585.489.5
MATH43.641.742.534.149.651.1
Chinese
C-Eval81.754.665.284.189.191.0
CMMLU84.053.467.283.588.390.1
Multilingual
Mulit-Exam67.563.570.066.475.676.6
Multi-Understanding77.077.779.978.278.280.7
Multi-Mathematics58.862.967.161.764.476.0
Multi-Translation36.023.338.035.636.237.8

Qwen2-57B-A14B

DatasetsJambaMixtral-8x7BYi-1.5-34BQwen1.5-32BQwen2-57B-A14B
ArchitectureMoEMoEDenseDenseMoE
#Activated Params12B12B34B32B14B
#Params52B47B34B32B57B
English
MMLU67.471.877.174.376.5
MMLU-Pro-41.048.344.043.0
GPQA-29.2-30.834.3
Theorem QA-23.2-28.833.5
BBH45.450.376.466.867.0
HellaSwag87.186.585.985.085.2
Winogrande82.581.984.981.579.5
ARC-C64.466.065.663.664.1
TruthfulQA46.451.153.957.457.7
Coding
HumanEval29.337.246.343.353.0
MBPP-63.965.564.271.9
EvalPlus-46.451.950.457.2
MultiPL-E-39.039.538.549.8
Mathematics
GSM8K59.962.582.776.880.7
MATH-30.841.736.143.0
Chinese
C-Eval---83.587.7
CMMLU--84.882.388.5
Multilingual
Multi-Exam-56.158.361.665.5
Multi-Understanding-70.773.976.577.0
Multi-Mathematics-45.049.356.162.3
Multi-Translation-29.830.033.534.5

Qwen2-7B

DatasetsMistral-7BGemma-7BLlama-3-8BQwen1.5-7BQwen2-7B
# Params7.2B8.5B8.0B7.7B7.6B
# Non-emb Params7.0B7.8B7.0B6.5B6.5B
English
MMLU64.264.666.661.070.3
MMLU-Pro30.933.735.429.940.0
GPQA24.725.725.826.731.8
Theorem QA19.221.522.114.231.1
BBH56.155.157.740.262.6
HellaSwag83.282.282.178.580.7
Winogrande78.479.077.471.377.0
ARC-C60.061.159.354.260.6
TruthfulQA42.244.844.051.154.2
Coding
HumanEval29.337.233.536.051.2
MBPP51.150.653.951.665.9
EvalPlus36.439.640.340.054.2
MultiPL-E29.429.722.628.146.3
Mathematics
GSM8K52.246.456.062.579.9
MATH13.124.320.520.344.2
Chinese
C-Eval47.443.649.574.183.2
CMMLU--50.873.183.9
Multilingual
Multi-Exam47.142.752.347.759.2
Multi-Understanding63.358.368.667.672.0
Multi-Mathematics26.339.136.337.357.5
Multi-Translation23.331.231.928.431.5

Qwen2-0.5B & Qwen2-1.5B

DatasetsPhi-2Gemma-2BMiniCPMQwen1.5-1.8BQwen2-0.5BQwen2-1.5B
#Non-Emb Params2.5B2.0B2.4B1.3B0.35B1.3B
MMLU52.742.353.546.845.456.5
MMLU-Pro-15.9--14.721.8
Theorem QA----8.915.0
HumanEval47.622.050.020.122.031.1
MBPP55.029.247.318.022.037.4
GSM8K57.217.753.838.436.558.5
MATH3.511.810.210.110.721.7
BBH43.435.236.924.228.437.2
HellaSwag73.171.468.361.449.366.6
Winogrande74.466.8-60.356.866.2
ARC-C61.148.5-37.931.543.9
TruthfulQA44.533.1-39.439.745.9
C-Eval23.428.051.159.758.270.6
CMMLU24.2-51.157.855.170.3

Instruction-tuned Model Evaluation1

Qwen2-72B-Instruct

DatasetsLlama-3-70B-InstructQwen1.5-72B-ChatQwen2-72B-Instruct
English
MMLU82.075.682.3
MMLU-Pro56.251.764.4
GPQA41.939.442.4
TheroemQA42.528.844.4
MT-Bench8.958.619.12
Arena-Hard41.136.148.1
IFEval (Prompt Strict-Acc.)77.355.877.6
Coding
HumanEval81.771.386.0
MBPP82.371.980.2
MultiPL-E63.448.169.2
EvalPlus75.266.979.0
LiveCodeBench29.317.935.7
Mathematics
GSM8K93.082.791.1
MATH50.442.559.7
Chinese
C-Eval61.676.183.8
AlignBench7.427.288.27

Qwen2-57B-A14B-Instruct

DatasetsMixtral-8x7B-Instruct-v0.1Yi-1.5-34B-ChatQwen1.5-32B-ChatQwen2-57B-A14B-Instruct
ArchitectureMoEDenseDenseMoE
#Activated Params12B34B32B14B
#Params47B34B32B57B
English
MMLU71.476.874.875.4
MMLU-Pro43.352.346.452.8
GPQA--30.834.3
TheroemQA--30.933.1
MT-Bench8.308.508.308.55
Coding
HumanEval45.175.268.379.9
MBPP59.574.667.970.9
MultiPL-E--50.766.4
EvalPlus48.5-63.671.6
LiveCodeBench12.3-15.225.5
Mathematics
GSM8K65.790.283.679.6
MATH30.750.142.449.1
Chinese
C-Eval--76.780.5
AlignBench5.707.207.197.36

Qwen2-7B-Instruct

DatasetsLlama-3-8B-InstructYi-1.5-9B-ChatGLM-4-9B-ChatQwen1.5-7B-ChatQwen2-7B-Instruct
English
MMLU68.469.572.459.570.5
MMLU-Pro41.0--29.144.1
GPQA34.2--27.825.3
TheroemQA23.0--14.125.3
MT-Bench8.058.208.357.608.41
Coding
Humaneval62.266.571.846.379.9
MBPP67.9--48.967.2
MultiPL-E48.5--27.259.1
Evalplus60.9--44.870.3
LiveCodeBench17.3--6.026.6
Mathematics
GSM8K79.684.879.660.382.3
MATH30.047.750.623.249.6
Chinese
C-Eval45.9-75.667.377.2
AlignBench6.206.907.016.207.21

Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct

DatasetsQwen1.5-0.5B-ChatQwen2-0.5B-InstructQwen1.5-1.8B-ChatQwen2-1.5B-Instruct
MMLU35.037.943.752.4
HumanEval9.117.125.037.8
GSM8K11.340.135.361.6
C-Eval37.245.255.363.8
IFEval (Prompt Strict-Acc.)14.620.016.829.0

Multilingual capability of instruction-tuned models

We compare Qwen2 instruction-tuned models with other recent LLMs on several cross-lingual open benchmarks as well as by human evaluation. For benchmarks, we show the results on 2 evaluation datasets:

  • M-MMLU from Okapi: multilingual commonsense evaluation (we evaluate with a subset on ar, de, es, fr, it, nl, ru, uk, vi, zh)
  • MGSM: math evaluation on languages including de, en, es, fr, ja, ru, th, zh and bn

The results are averaged over languages for each benchmark and shown as follows:

ModelsM-MMLU (5-shot)MGSM (0-shot, CoT)
Proprietary LLMs
GPT-4-061378.087.0
GPT-4-Turbo-040979.390.5
GPT-4o-051383.289.6
Claude-3-Opus-2024022980.191.0
Claude-3-Sonnet-2024022971.085.6
Open-source LLMs
command-r-plus-110b65.563.5
Qwen1.5-7B-Chat50.037.0
Qwen1.5-32B-Chat65.065.0
Qwen1.5-72B-Chat68.471.7
Qwen2-7B-Instruct60.057.0
Qwen2-57B-A14B-Instruct68.074.0
Qwen2-72B-Instruct78.086.6

For human evaluation, we compare Qwen2-72B-Instruct with GPT3.5, GPT4 and Claude-3-Opus using in-house evaluation set, which includes 10 languages ar, es, fr, ko, th, vi, pt, id, ja and ru (the scores range from 1~5):

ModelsaresfrkothviptidjaruAverage
Claude-3-Opus-202402294.154.314.234.234.013.984.094.403.854.254.15
GPT-4o-05133.554.264.164.404.094.143.894.393.724.324.09
GPT-4-Turbo-04093.444.084.194.244.113.843.864.093.684.273.98
Qwen2-72B-Instruct3.864.104.014.143.753.913.973.833.634.153.93
GPT-4-06133.553.923.943.873.833.953.553.773.063.633.71
GPT-3.5-Turbo-11062.524.073.472.373.382.903.373.562.753.243.16

Grouped by task types, the results are shown as follows:

ModelsKnowledgeUnderstandingCreationMath
Claude-3-Opus-202402293.644.454.423.81
GPT-4o-05133.764.354.453.53
GPT-4-Turbo-04093.424.294.353.58
Qwen2-72B-Instruct3.414.074.363.61
GPT-4-06133.424.094.103.32
GPT-3.5-Turbo-11063.373.673.892.97

These results demonstrate the strong multilingual capabilities of Qwen2 instruction-tuned models.


  1. Update on 2024-07-16: The results of instruction-tuned models may differ from those presented in the technical report; in case of any discrepancy, the results documented in the technical report should take precedence. ↩︎ ↩︎ ↩︎